{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 神经网络的损失函数的原理\n",
    "\n",
    "\n",
    "* b站线性回归相关讲解视频\n",
    "\n",
    "https://www.bilibili.com/video/BV1QM4y167oZ/?spm_id_from=333.788&vd_source=09af776e3fe3bbe20ff8249e4a9b1581\n",
    "\n",
    "\n",
    "### 回归问题中的损失函数一般为均方误差(Mean Squared Error, MSE)\n",
    "\n",
    "误差平方和损失函数(Mean Squared Error, MSE)是回归任务中常用的损失函数,其思想是计算预测值和真实值的平方差误差。\n",
    "\n",
    "MSE 函数的表达式为:\n",
    "\n",
    "$$ MSE=\\frac{1}{n}\\sum_{i=1}^{n}(y_{pred_{i}}-y_{true_{i}})^{2}$$\n",
    "\n",
    "其中$n$是样本数量,$y_{pred_{i}}$是预测值,$y_{true_{i}}$是真实标签。\n",
    "\n",
    "MSE的主要特点:\n",
    "\n",
    "- 易于计算,代码实现简单\n",
    "- 对大误差VERY敏感,能惩罚大的误差\n",
    "- 容易落入局部最小值\n",
    "\n",
    "在PyTorch中,MSE损失函数的实现代码如下:\n",
    "这里直接使用PyTorch内置的MSELoss计算了预测值和真实值之间的MSE。参数reduction，默认是”mean“，表示求解所有样本平均的损失，也可换为”sum”，要求输出整体的损失。以及，还可以使用选项“none”，表示不对损失结果做任何聚合运算，直接输出每个样本对应的损失矩阵。\n",
    "\n",
    "\n",
    "### 二分类问题中的交叉熵损失函数(Binary Cross Entropy, BCE)\n",
    "\n",
    "\n",
    "* 极大似然估计\n",
    "\n",
    "https://blog.csdn.net/m0_70832728/article/details/138375478\n",
    "\n",
    "极大似然估计(Maximum Likelihood Estimation, MLE)是统计模型的参数估计方法,其基本思想是最大化模型对观测数据的似然度,由以下公式表达:\n",
    "\n",
    "设观测数据为 $X = {x_1, x_2, ..., x_n}$,参数为$\\theta$,概率密度函数为$p(x|\\theta)$。\n",
    "\n",
    "则似然函数为所有数据的联合概率:\n",
    "\n",
    "$$L(\\theta|X) = \\prod_{i=1}^{n}p(x_i|\\theta)$$\n",
    "\n",
    "极大似然估计的目标是找到参数$\\theta$,最大化似然函数:\n",
    "\n",
    "$$\\hat{\\theta}{MLE} = \\arg\\max{\\theta} L(\\theta | X) = \\arg\\max_{\\theta} \\prod_{i=1}^{n} p(x_i | \\theta)$$\n",
    "\n",
    "取对数可得log似然函数:\n",
    "\n",
    "$$\\ell(\\theta|X) = \\log L(\\theta|X) = \\sum_{i=1}^{n}\\log p(x_i|\\theta)$$\n",
    "\n",
    "通过最大化log似然函数可以得到MLE解。\n",
    "\n",
    "MLE的思想是找到最有可能产生观测数据的参数值,可广泛应用于统计模型的参数估计。\n",
    "\n",
    "* 二分类问题中的交叉熵损失函数\n",
    "\n",
    "https://zhuanlan.zhihu.com/p/38241764\n",
    "\n",
    "用来衡量两个概率分布之间的距离,通常用来训练二分类神经网络。\n",
    "\n",
    "其表达式为:  \n",
    "\n",
    "$$ BCE(y, \\hat{y}) = - (y \\log \\hat{y} + (1-y) \\log (1-\\hat{y})) $$\n",
    "\n",
    "这里$y$是真实标签,$ŷ$是模型预测概率。\n",
    "\n",
    "BCE的主要特点:\n",
    "\n",
    "- 输出使用Sigmoid函数,范围在[0,1]表示概率\n",
    "- 对正负样本同时起作用\n",
    "- 易于求导,便于梯度下降\n",
    "- 输出非0即1时效果最好\n",
    "\n",
    "\n",
    "### 多类交叉熵损失函数(Multi-class Cross Entropy)\n",
    "\n",
    "对于多分类问题,我们通常使用多类交叉熵损失函数(Multi-class Cross Entropy)。\n",
    "\n",
    "其计算公式为:\n",
    "\n",
    "$$ BCE(y, \\hat{y}) = - \\sum_{c=1}^My_{o,c}\\log(\\hat{y}_{o,c}) $$\n",
    "\n",
    "这里$y_o$是onehot编码的真实标签,$\\hat{y}_o$是模型输出的概率,M是类别数量。\n",
    "\n",
    "多类交叉熵的工作原理是:\n",
    "\n",
    "1. 将真实标签转化为onehot向量,只有对应类别索引位置为1。\n",
    "\n",
    "2. 模型通过Softmax输出每个类别的预测概率。\n",
    "\n",
    "3. 交叉熵度量预测概率分布与onehot标签分布的距离。\n",
    "\n",
    "4. 最小化此距离等效于最大化对的类别概率。\n",
    "\n",
    "PyTorch中多类交叉熵损失函数的实现代码:\n",
    "\n",
    "所以多类交叉熵适合多分类问题,通过onehot编码 label 并最小化预测概率与真实label 分布之间的距离来训练模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pytorch中的损失函数使用\n",
    "\n",
    "1. 均方误差损失函数(MSE Loss):用于回归问题,计算预测值与真实值之间的均方误差。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.6667)\n"
     ]
    },
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m在当前单元格或上一个单元格中执行代码时 Kernel 崩溃。\n",
      "\u001b[1;31m请查看单元格中的代码，以确定故障的可能原因。\n",
      "\u001b[1;31m单击<a href='https://aka.ms/vscodeJupyterKernelCrash'>此处</a>了解详细信息。\n",
      "\u001b[1;31m有关更多详细信息，请查看 Jupyter <a href='command:jupyter.viewOutput'>log</a>。"
     ]
    }
   ],
   "source": [
    "# 定义两个张量并计算其均方差损失\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "# 预测值 \n",
    "y_pred = torch.tensor([2, 3, 4], dtype = torch.float32)  \n",
    "# 真实值\n",
    "y_true = torch.tensor([1, 3, 5], dtype = torch.float32)\n",
    "# 计算 MSE\n",
    "criterion = nn.MSELoss()\n",
    "loss = criterion(y_pred, y_true)\n",
    "\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.6667)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#loss.backward()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(3.0311, grad_fn=<MseLossBackward0>)\n"
     ]
    }
   ],
   "source": [
    "# 在上节课回归神经网络的基础上，向前传播一次，并计算最后的均方差损失\n",
    "class Net(torch.nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Net,self).__init__()\n",
    "        self.h1=torch.nn.Linear(3,3)\n",
    "        self.h2=torch.nn.Linear(3,2)\n",
    "        self.out=torch.nn.Linear(2,1)\n",
    "    def forward(self,x):\n",
    "        h1_out=self.h1(x)\n",
    "        h1_out_r=torch.relu(h1_out)\n",
    "        h2_out=self.h2(h1_out_r)\n",
    "        h2_out_r=torch.relu(h2_out)\n",
    "        out_out=self.out(h2_out_r)\n",
    "        return out_out\n",
    "net=Net()\n",
    "x=torch.tensor([[1.0,2.0,1.5],[3.0,4.0,2.3],[5.0,6.0,1.2]])\n",
    "y_true=torch.tensor([[1.0],[2.0],[3.0]])\n",
    "y_pred=net(x)\n",
    "criterion = nn.MSELoss()\n",
    "loss = criterion(y_pred, y_true)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.4478],\n",
       "        [0.4658],\n",
       "        [0.4633]], grad_fn=<AddmmBackward0>)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 二分类交叉熵损失，nn提供了两个类：BCEWithLogitsLoss以及BCELoss。\n",
    "\n",
    "虽然PyTorch官方没有直接明确，但实际上两个函数所需要输入的参数不同。\n",
    "BCEWithLogitsLoss内置了sigmoid函数与交叉熵函数，它会自动计算输入值的sigmoid值，因此需要预测值与真实标签，且顺序不能变化，预测值必须在前。相对的，BCELoss中只有交叉熵函数，没有sigmoid层，因此需要输入sigma与真实标签，且顺序不能变化。同时，这两个函数都要求预测值与真实标签的数据类型以及结构（shape）必须相同，否则运行就会报错。\n",
    "接下来，我们来看看这两个类是如何使用的：\n",
    "\n",
    "可以看出，两个类的结果是一致的。根据PyTorch官方的公告，他们更推荐使用BCEWithLogitsLoss这个内置了sigmoid函数的类。内置的sigmoid函数可以让精度问题被缩小（因为将指数运算包含在了内部），以维持算法运行时的稳定性，即是说当数据量变大、数据本身也变大时，BCELoss类产生的结果可能有精度问题。所以，当我们的输出层使用sigmoid函数时，我们就可以使用BCEWithLogitsLoss作为损失函数。与MSELoss相同，二分类交叉熵的类们也有参数reduction，默认是”mean“，表示求解所有样本平均的损失，也可换为”sum”，要求输出整体的损失。以及，还可以使用选项“none”，表示不对损失结果做任何聚合运算，直接输出每个样本对应的损失矩阵。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.0628)\n"
     ]
    }
   ],
   "source": [
    "# 定义张量，注意值的范围，并计算其BCE损失和BCEWithLogits损失\n",
    "\n",
    "# 预测概率 \n",
    "y_pred = torch.tensor([0.02, 0.9])  \n",
    "# 真实标签\n",
    "y_true = torch.tensor([0, 1], dtype = torch.float32) \n",
    "# 计算BCE\n",
    "criterion = nn.BCELoss()  \n",
    "loss = criterion(y_pred, y_true)\n",
    "\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.0635)\n"
     ]
    }
   ],
   "source": [
    "criterion2 = nn.BCEWithLogitsLoss() #内置了sigmoid函数\n",
    "loss = criterion2(torch.tensor([-2, 9.9]) , y_true)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.6435, grad_fn=<BinaryCrossEntropyBackward0>)\n"
     ]
    }
   ],
   "source": [
    "# 用上节课的二分类网络，向前传播并计算器的BCE损失和BCEWithLogits损失\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class Net(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Net, self).__init__()\n",
    "        self.fc1 = nn.Linear(3, 1)\n",
    "        self.fc2 = nn.Linear(1, 1)\n",
    "    def forward(self, x):\n",
    "        x = F.relu(self.fc1(x))\n",
    "        x = torch.sigmoid(self.fc2(x))\n",
    "        return x\n",
    "net = Net()\n",
    "x=torch.tensor([[1.0,2.0,0.4],[3.0,4,0.1],[5.0,6.0,0.3]],dtype = torch.float32)\n",
    "y_true=torch.tensor([[1.0],[0.0],[1.0]],dtype = torch.float32)\n",
    "y_pred=net(x)\n",
    "criterion = nn.BCELoss()\n",
    "loss = criterion(y_pred, y_true)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. 多分类交叉熵损失函数(CrossEntropyLoss):用于多分类问题,计算预测值与真实值之间的交叉熵。\n",
    "\n",
    "* 多分类交叉熵目标函数的维度不同 y.squeeze()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(0.9602)\n"
     ]
    }
   ],
   "source": [
    "#定义适合多分类的张量，并计算其多类交叉熵损失\n",
    "# 预测概率\n",
    "y_pred = torch.tensor([[0.1, 0.2, 0.7],\n",
    "                       [0.3, 0.5, 0.2],[0.1, 0.4, 0.3]])\n",
    "# 真实标签 \n",
    "y_true = torch.tensor([2,0,1]) #由于交叉熵损失需要将标签转化为独热形式，因此不接受浮点数作为标签的输入\n",
    "\n",
    "# 计算多类交叉熵  \n",
    "criterion = nn.CrossEntropyLoss()\n",
    "loss = criterion(y_pred, y_true)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(1.1211, grad_fn=<NllLossBackward0>)\n"
     ]
    }
   ],
   "source": [
    "#用上节课多分类网络，向前传播并计算其多类交叉熵损失\n",
    "class Net(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Net, self).__init__()\n",
    "        self.fc1 = nn.Linear(3, 3)\n",
    "        self.fc2 = nn.Linear(3, 2)\n",
    "        self.fc3 = nn.Linear(2, 3)\n",
    "    def forward(self, x):\n",
    "        x = F.relu(self.fc1(x))\n",
    "        x = F.relu(self.fc2(x))\n",
    "        x = F.softmax(self.fc3(x), dim=1)\n",
    "        return x\n",
    "net = Net()\n",
    "x=torch.tensor([[1.0,2.0,0.4],[3.0,4,0.1],[5.0,6.0,0.3]],dtype = torch.float32)\n",
    "y_true=y = torch.tensor([0, 1, 2], dtype=torch.long)\n",
    "y_pred=net(x)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "loss = criterion(y_pred, y_true)\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.4062, 0.1495, 0.4444],\n",
       "        [0.4281, 0.1621, 0.4098],\n",
       "        [0.4385, 0.1687, 0.3927]], grad_fn=<SoftmaxBackward0>)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 神经网络的反向传播"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"20.png\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Image\n",
    "Image(url= \"20.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"16.png\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url= \"16.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 反向传播中的链式求导\n",
    "\n",
    "* 链式求导举例\n",
    "\n",
    "利用链式法则来求解 $ f(x) = (a - x)^2 $ 关于 $ x $ 的导数是一个更直接的方法。\n",
    "\n",
    "链式法则是这样的：如果有一个复合函数 $ h(x) = g(f(x)) $，那么 $ h(x) $ 的导数可以通过下面的方式得到：\n",
    "$$ h'(x) = g'(f(x)) \\cdot f'(x) $$\n",
    "\n",
    "对于函数 $ f(x) = (a - x)^2 $，我们可以将其看作是两个函数的复合：$ u = a - x $ 和 $ y = u^2 $。\n",
    "\n",
    "首先，我们找到内层函数 $ u = a - x $ 的导数：\n",
    "$$ \\frac{du}{dx} = -1 $$\n",
    "\n",
    "接着，我们找到外层函数 $ y = u^2 $ 关于 $ u $ 的导数：\n",
    "$$ \\frac{dy}{du} = 2u $$\n",
    "\n",
    "现在，将两者组合起来，应用链式法则：\n",
    "$$ \\frac{dy}{dx} = \\frac{dy}{du} \\cdot \\frac{du}{dx} $$\n",
    "$$ \\frac{dy}{dx} = 2u \\cdot (-1) $$\n",
    "$$ \\frac{dy}{dx} = -2(a - x) $$\n",
    "\n",
    "所以，$ (a - x)^2 $ 对 $ x $ 的导数是 $ -2(a - x) $，或者写成与之前结果一致的形式 $ 2(x - a) $。\n",
    "\n",
    "\n",
    "* 图文过程请参考如下链接：\n",
    "\n",
    "https://www.mladdict.com/neural-network-simulator\n",
    "\n",
    "反向传播(Backpropagation)是训练神经网络的重要算法,它的计算过程和原理可以用以下公式表达:\n",
    "\n",
    "我们假设一个简单的多层前馈神经网络,包含输入层x,隐藏层a,输出层y,以及对应的权重矩阵W和偏置b。\n",
    "\n",
    "那么这个网络的前向传播可以表示为:\n",
    "\n",
    "隐藏层:\n",
    "$$ z^{(2)} = W^{(2)} x $$ \n",
    "$$ a^{(2)} = \\sigma(z^{(2)}) $$\n",
    "\n",
    "输出层:\n",
    "$$ z^{(3)} = W^{(3)} a^{(2)}$$\n",
    "$$ a^{(3)} = \\sigma(z^{(3)}) $$\n",
    "\n",
    "其中f表示激活函数。\n",
    "\n",
    "在反向传播中,我们以损失函数L为参考,计算每个变量对L的梯度贡献:\n",
    "\n",
    "1. 输出层权重的梯度:\n",
    "$$ \\frac{\\partial L}{\\partial W^{(3)}} = \\frac{\\partial L}{\\partial a^{(3)}} \\frac{\\partial a^{(3)}}{\\partial z^{(3)}} \\frac{\\partial z^{(3)}}{\\partial W^{(3)}} $$\n",
    "\n",
    "\n",
    "2. 隐藏层权重的梯度:\n",
    "$$ \\frac{\\partial L}{\\partial W^{(2)}} = \\frac{\\partial L}{\\partial z^{(3)}} \\frac{\\partial z^{(3)}}{\\partial a^{(2)}} \\frac{\\partial a^{(2)}}{\\partial z^{(2)}} \\frac{\\partial z^{(2)}}{\\partial W^{(2)}}$$\n",
    "\n",
    "由此可更新每个参数,使loss函数L最小化。\n",
    "\n",
    "反向传播自动进行链式法则的梯度计算,实现了有效的深度网络训练。这是它被广泛采用的关键原因。\n",
    "\n",
    "* sigmoid函数的求导\n",
    "\n",
    "sigmoid函数是一个常用的激活函数，其数学形式为：\n",
    "\n",
    "$$ \n",
    "    \\sigma(x) = \\frac{1}{1 + e^{-x}} \n",
    "    $$\n",
    "\n",
    "sigmoid函数的导数可以用来计算神经网络中的梯度，这对于反向传播算法来说是非常重要的。sigmoid函数的导数具有一个非常方便的形式，可以表示为：\n",
    "\n",
    "$$ \\sigma'(x) = \\sigma(x) (1 - \\sigma(x)) $$\n",
    "\n",
    "这个导数的性质是它在sigmoid函数的输出上自然地产生了调节作用。当sigmoid函数的输出接近0或1时，导数的值也接近0，这意味着在这些区域的变化率很小；而在输出接近0.5时，导数达到最大值，变化率最大。\n",
    "\n",
    "下面是推导过程的一个简要概述：\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\sigma(x) & = \\frac{1}{1 + e^{-x}} \\\\\n",
    "\\sigma'(x) & = \\frac{d}{dx} \\left( \\frac{1}{1 + e^{-x}} \\right) \\\\\n",
    "& = \\frac{e^{-x}}{(1 + e^{-x})^2} \\\\\n",
    "& = \\frac{1}{1 + e^{-x}} \\cdot \\frac{e^{-x}}{1 + e^{-x}} \\\\\n",
    "& = \\sigma(x) \\cdot (1 - \\sigma(x))\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "这个简单的导数形式使得sigmoid函数在早期的神经网络中非常流行。然而，如前所述，sigmoid函数在远离原点时的梯度消失问题限制了其在深层网络中的应用，因此现在更常使用像ReLU这样的激活函数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([3.6598, 5.7086])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#定义一个张量的计算过程，并通过backward计算其梯度\n",
    "x=torch.tensor([[1.0,2.0],[3.4,2.1]],dtype = torch.float32)\n",
    "w=torch.tensor([-0.0588,  0.6530],requires_grad=True)\n",
    "y=torch.tensor([0.0,1.0],dtype = torch.float32)\n",
    "f=(torch.mv(x,w)-y)**2\n",
    "f=f.sum()\n",
    "f.backward()\n",
    "w.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(1.5849)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "w=torch.tensor([-0.0588,  0.6530],requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([3.6598, 5.7086])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(1.5849, grad_fn=<SumBackward0>)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用loss.backward()完成上面多分类网络的求导\n",
    "class Model0(nn.Module):\n",
    "    def __init__(self,in_features=10,out_features=2):\n",
    "        super(Model0,self).__init__()\n",
    "        self.h1=nn.Linear(3,3,bias=True)\n",
    "        self.h2=nn.Linear(3,2,bias=True)\n",
    "        self.out=nn.Linear(2,3,bias=True)\n",
    "\n",
    "    def forward(self, x):\n",
    "        h1_out=self.h1(x)\n",
    "        h1_out_r=torch.relu(h1_out)\n",
    "        h2_out=self.h2(h1_out_r)\n",
    "        h2_out_r=torch.relu(h2_out)\n",
    "        out_out=self.out(h2_out_r)\n",
    "        out_out_s=torch.softmax(out_out,dim=1)\n",
    "        return out_out_s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "x=torch.tensor([[1.0,2.0,1.5],[3.0,4.0,2.3],[5.0,6.0,1.2]])\n",
    "z=torch.tensor([0,2,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor([[-2.3843e-02, -7.0666e-03,  4.7106e-02],\n",
       "        [-2.0805e-04, -6.1661e-05,  4.1103e-04],\n",
       "        [-2.0353e-02, -6.0320e-03,  4.0209e-02]])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#实例化网络并完成一次前向传播和求导\n",
    "net=Model0(3,3)\n",
    "#net(x)\n",
    "output=net.forward(x)\n",
    "#print(output)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "loss = criterion(output,z)#在PyTorch中,有些情况下目标向量(target)需要使用.long()方法转换为长整型tensor(long tensor),\n",
    "print(net.h1.weight.grad)\n",
    "loss.backward()\n",
    "net.h1.weight.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0.3665, 0.3365, 0.2970],\n",
       "        [0.3359, 0.3670, 0.2972],\n",
       "        [0.2837, 0.4226, 0.2937]], grad_fn=<SoftmaxBackward0>)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 梯度下降算法\n",
    "\n",
    "\n",
    "### 梯度向量\n",
    "\n",
    "梯度向量的方向和大小确定方式可以用以下公式解释:\n",
    "\n",
    "假设有目标函数$f(\\mathbf{x})$,其中$\\mathbf{x}$是$n$维向量。\n",
    "\n",
    "则目标函数$f$相对于$\\mathbf{x}$的梯度是一个向量:\n",
    "\n",
    "$$\\nabla_{\\mathbf{x}} f = \\begin{bmatrix} \\frac{\\partial f}{\\partial x_1} \\\\ \\frac{\\partial f}{\\partial x_2} \\\\ \\vdots \\\\ \\frac{\\partial f}{\\partial x_n} \\end{bmatrix}$$\n",
    "\n",
    "其中:\n",
    "\n",
    "- $\\frac{\\partial f}{\\partial x_i}$表示$f$相对于$x_i$的偏导数\n",
    "\n",
    "- 梯度向量的每个元素表示目标函数相对于对应变量的变化率\n",
    "\n",
    "梯度向量方向性质:\n",
    "\n",
    "- 梯度方向与目标函数增长最快的方向相反\n",
    "\n",
    "- 沿负梯度方向移动可以使目标函数值下降最快\n",
    "\n",
    "梯度向量大小性质:\n",
    "\n",
    "- 梯度大小表示目标函数沿该方向的变化率\n",
    "\n",
    "- 梯度越大,目标函数越“陡峭”\n",
    "\n",
    "所以梯度向量充分表示了目标函数的局部变化,方向和大小的合理利用都有利于设计优化算法。\n",
    "\n",
    "\n",
    "\n",
    "### 对于权重参数的梯度下降的更新公式\n",
    "\n",
    "假设一个优化问题,其目标是最小化损失函数$L(\\omega)$,其中ω表示模型的参数。\n",
    "\n",
    "则梯度下降法的迭代更新公式为:\n",
    "\n",
    "$$\\omega^{(t+1)} = \\omega^{(t)} - \\eta \\nabla_{\\omega} L(\\omega^{(t)})$$\n",
    "\n",
    "其中:\n",
    "\n",
    "- $\\omega^{(t)}$: 模型参数在第$t$次迭代的值  \n",
    "- $\\eta$: 学习率,确定更新步长\n",
    "- $\\nabla_{\\omega} L(\\omega^{(t)})$: $L$相对于$\\omega$的梯度向量\n",
    "- $t$: 迭代次数\n",
    "\n",
    "梯度下降的迭代过程是:\n",
    "\n",
    "1. 初始化参数$\\omega$ \n",
    "\n",
    "2. 计算损失函数$L(\\omega)$关于参数$\\omega$的梯度$\\nabla_{\\omega} L(\\omega)$\n",
    "\n",
    "3. 沿负梯度方向更新参数,步长为学习率$\\eta$\n",
    "\n",
    "4. 重复2-3,直到损失函数收敛或达到预定迭代次数\n",
    "\n",
    "\n",
    "## pytorch中的求导与梯度下降实现\n",
    "\n",
    "1. backward()函数的求导机制\n",
    "2. requires_grad=True 与 grad_fn 的作用\n",
    "\n",
    "* loss.backward()\n",
    "* net.h1.weight.grad\n",
    "* net.h1.weight.data\n",
    "* lr  dw  w  gamma\n",
    "* v = torch.zeros(dw.shape[0],dw.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Parameter containing:\n",
       "tensor([[ 0.2992,  0.3392,  0.3109],\n",
       "        [-0.1067,  0.2977,  0.1611],\n",
       "        [-0.3067,  0.3249,  0.0872]], requires_grad=True)"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.h1.weight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[-2.3843e-02, -7.0666e-03,  4.7106e-02],\n",
       "        [-2.0805e-04, -6.1661e-05,  4.1103e-04],\n",
       "        [-2.0353e-02, -6.0320e-03,  4.0209e-02]])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.h1.weight.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[0., 0., 0.],\n",
       "        [0., 0., 0.],\n",
       "        [0., 0., 0.]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取net.h1.weight.grad与net.h1.weight.data 并手动进行梯度下降更新\n",
    "dw=net.h1.weight.grad\n",
    "torch.zeros(dw.shape[0],dw.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[ 0.2992,  0.3392,  0.3109],\n",
      "        [-0.1067,  0.2977,  0.1611],\n",
      "        [-0.3067,  0.3249,  0.0872]])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "tensor([[ 0.3016,  0.3399,  0.3062],\n",
       "        [-0.1067,  0.2977,  0.1610],\n",
       "        [-0.3047,  0.3255,  0.0832]])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dw=net.h1.weight.grad#+lr*dw1\n",
    "lr=0.1\n",
    "w0=net.h1.weight.data\n",
    "print(w0)\n",
    "w1=w0-lr*dw\n",
    "w1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"21.png\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url= \"21.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 动量法梯度下降\n",
    "\n",
    "基于动量法(Momentum)的梯度下降可以使用动量项加速学习,提高收敛速度。其公式表达如下:\n",
    "\n",
    "假设参数为$\\omega$,损失函数为$J(\\omega)$,学习率为$\\eta$。\n",
    "\n",
    "基础梯度下降更新:\n",
    "\n",
    "$$\\omega_{t+1} = \\omega_{t} - \\eta \\nabla_\\omega J(\\omega_{t}) $$\n",
    "\n",
    "引入动量项$\\nu$,动量参数为$\\gamma$:\n",
    "\n",
    "$$\\nu_{t+1} = \\gamma \\nu_{t} + \\eta \\nabla_\\omega J(\\omega_{t})$$\n",
    "\n",
    "$$\\omega_{t+1} = \\omega_{t} - \\nu_{t+1}$$\n",
    "\n",
    "动量法的思想:\n",
    "\n",
    "- 使用$\\nu$存储历史梯度作为动量\n",
    "- $\\nu$平滑更新,抑制梯度震荡\n",
    "- $\\gamma$控制历史梯度的使用比例\n",
    "\n",
    "优点:\n",
    "\n",
    "- 加速学习,提高收敛速度\n",
    "- 有效抑制共线性等问题造成的震荡\n",
    "\n",
    "动量法利用历史梯度产生动能,帮助优化器逃离局部最优,是改进梯度下降的常用技巧。\n",
    "\n",
    "* net.h1.weight.grad\n",
    "* net.h1.weight.data\n",
    "* lr  dw  w  gamma\n",
    "* v = torch.zeros(dw.shape[0],dw.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor([[ 0.3016,  0.3399,  0.3062],\n",
       "        [-0.1067,  0.2977,  0.1610],\n",
       "        [-0.3047,  0.3255,  0.0832]])"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 假设前一步梯度为v,则当前梯度为dw,则当前动量v1\n",
    "lr=0.1\n",
    "gamma=0.5\n",
    "v = torch.zeros(dw.shape[0],dw.shape[1])\n",
    "dw=net.h1.weight.grad\n",
    "w0=net.h1.weight.data\n",
    "v1=v*gamma+lr*dw\n",
    "w1=w0-v1\n",
    "w1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### torch.optim\n",
    "\n",
    "https://zhuanlan.zhihu.com/p/346205754\n",
    "\n",
    "* 基本用法\n",
    "\n",
    "优化器主要是在模型训练阶段对模型可学习参数进行更新, 常用优化器有 SGD，RMSprop，Adam等\n",
    "\n",
    "优化器初始化时传入传入模型的可学习参数，以及其他超参数如 lr，momentum等\n",
    "\n",
    "在训练过程中先调用 optimizer.zero_grad() 清空梯度，再调用 loss.backward() 反向传播，最后调用 optimizer.step()更新模型参数\n",
    "\n",
    "* 父类Optimizer 基本原理\n",
    "\n",
    "Optimizer 是所有优化器的父类，它主要有如下公共方法:\n",
    "\n",
    "add_param_group(param_group): 添加模型可学习参数组\n",
    "\n",
    "step(closure): 进行一次参数更新\n",
    "\n",
    "zero_grad(): 清空上次迭代记录的梯度信息\n",
    "\n",
    "state_dict(): 返回 dict 结构的参数状态\n",
    "\n",
    "load_state_dict(state_dict): 加载 dict 结构的参数状态\n",
    "\n",
    "* optim.SGD(net.parameters() , lr=lr , momentum = gamma)\n",
    "* opt.zero_grad()\n",
    "* loss.backward()\n",
    "* opt.step()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "x=torch.tensor([[1.0,2.0,1.5],[3.0,4.0,2.3],[5.0,6.0,1.2]])\n",
    "z=torch.tensor([0,2,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method Module.parameters of Model0(\n",
       "  (h1): Linear(in_features=3, out_features=3, bias=True)\n",
       "  (h2): Linear(in_features=3, out_features=2, bias=True)\n",
       "  (out): Linear(in_features=2, out_features=3, bias=True)\n",
       ")>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(1.1060, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.1052, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.1041, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.1029, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.1017, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.1005, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.0998, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.0998, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.0997, grad_fn=<NllLossBackward0>)\n",
      "tensor(1.0997, grad_fn=<NllLossBackward0>)\n"
     ]
    }
   ],
   "source": [
    "# 续写代码完成上面网络的循环训练\n",
    "lr=0.1\n",
    "gamma=0.5\n",
    "\n",
    "net=Model0(3,3)\n",
    "#print(output)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "opt=optim.SGD(net.parameters() , lr=lr , momentum = gamma)\n",
    "\n",
    "for i in range(10):\n",
    "    output=net.forward(x)\n",
    "    loss = criterion(output,z)#在PyTorch中,有些情况下目标向量(target)需要使用.long()方法转换为长整型tensor(long tensor),\n",
    "    opt.zero_grad()\n",
    "    loss.backward()\n",
    "    opt.step()\n",
    "    print(loss)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
